My last blog post has examined the CPU-only performance of different local LLM inference servers in an Oracle Ampere A1 instance. To simplify, we used a fixed prompt and measured single-request latency (TTFT and total time) and throughput (TPS) for each server. We also validated the gain in performance in quantized models. Finally, we tapped into resource usage, and saw that only memory footprint is dominated by model size, while CPU resource usage didn’t differ much among tools and models.
In this post, let’s expand our benchmark into quantifying whether prompt lengths can affect performance in our CPU-only environment. We will - look at the bottlenecks in CPU-based inferencing - explore the performance difference between fixed and varied prompts
Performance bottlenecks: Memory vs. Compute
LLM inference is divided into two distinct phases, each with a different bottleneck profile for CPU-based inferencing:
| Feature | Prompt Evaluation (Prefill) | Token Generation (Decode) |
|---|---|---|
| Goal | Generate the first token and build the KV Cache. | An autoregressive loop, predicting the next token based on current context. |
| Core Operation | Matrix-Matrix Multiplication (GEMM). | Matrix-Vector Multiplication (GEMV) for each new token, loading all model weights from RAM for minimal calculations relative transferred data size. |
| Parallelism | Highly parallelizable (context - all tokens in input prompt computed at once). | Strictly sequential (one token depends on the last). |
| CPU/Memory Usage | High CPU utilization. High initial memory demand to load all parameters and temporary matrices. | High memory bandwidth demand For every new token, CPU streams weights and KV cache from main RAM into caches and compute units. Low CPU utilization due to CPU mostly sits idle, waiting for memory controller to deliver the next chunk of parameters. |
| Bottleneck | Compute-bound, especially for short/medium prompts. Optimized heavily by llama-server thread flags (--threads-batch). Can also be memory-bound depending on the model/context size. |
Memory-bandwidth bound. This operation is limited by how fast RAM delivers data, not how fast the CPU cores can calculate (FLOPs). The former is always slower than the latter. This determines the TPS ceiling. |
| Metric Impacted | Time To First Token (TTFT) | Time Per Output Token (TPOT) and TPS |
The qwen−2.5−3b−q4_k_m model is about 2.5GB in size, which easily fits in our 24GB RAM. How fast our 4 A1 CPU cores can read the entire model from RAM for every single generated token will determine the bottleneck for token generation (TPS). The goal is to ensure the model is loaded efficiently and our 4 CPU cores are fully utilized without over-threading.
Benchmark Improvements
Here are the features of my V1 benchmark script, and the V2 improvements in this post:
| Feature | V1 | V2 |
|---|---|---|
| Prompt | Fixed prompt - limits variability | Vary prompt lengths or types for more realistic loads. Also included a mode with fixed input and output tokens. |
| Output | Averages | add stddev for variability, percentiles (e.g., p50/p95 latency), and token counts for context. |
| token separation | No input/output token separation | Uses tiktoken to measure prompt tokens vs. generated tokens separately for better insight. |
To isolate our test further, we will only run llama-server in this post. We will open both OCI public subnet’s ingress rule and the VM-level firewall to accept inbound traffic for the port llama-server listens on. This skips nginx overhead (although not ideal in a production environment). In addition, we will run our benchmark scripts in another machine to avoid resource contention with llama-server.
Fixed vs. Varied Prompts
We will first benchmark sequential (single) request with both fixed and varied prompts. Fixed prompt allows us to establish a baseline for our hardware, engine and model choice, whereas varied prompts strive to simulate real world usage performance.
Fixed prompt: isolate infra performance
In LLM inference benchmarking (e.g., MLPerf Inference, Hugging Face’s Open LLM Leaderboard, or tools like llm-evaluation-harness), fixing input length (e.g., pad/truncate prompts to exactly 512 tokens) and output length (e.g., set max_tokens=128 with temperature=0 for deterministic short responses) is a deliberate design choice.
- Concept: Use one simple, standardized prompt (e.g., “Explain memory caching in one sentence”) across all runs. Output length is also fixed.
- Purpose: To establish a maximum theoretical TPS baseline that isolates the raw performance of underlying inference infrastructure (hardware, model, serving framework). By eliminating prompt variability, we focus on system efficiently in handling a known workload.
- Cache utilization: Essential for measuring the effectiveness of KV cache reuse for identical or similar prefixes.
- Reproducibility: Variable lengths introduce noise from model-specific behaviors (e.g., early EOS tokens in shorter gens). Fixed lengths ensure identical workloads per run, minimizing variance from hardware jitter or caching.
Code
We first import all necessary libraries and define common constants.
import openai
import time
import os
import statistics
import asyncio
import csv
import random
from typing import List, Dict, Any
import tiktoken
import numpy as np
# --- CONFIGURATION ---
SERVICES = {
"llama.cpp": {
"base_url": "http://<LLAMA.CPP HOST>:7775/v1",
"model": "qwen2.5:3B-Q4_K_M",
"api_key": "llama"
},
}
MAX_TOKENS = 50
LOREM_IPSUM = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."We then define core utility functions.
def count_tokens(text: str, encoding_name: str = "cl100k_base") -> int:
"""Accurate token counting using tiktoken."""
try:
enc = tiktoken.get_encoding(encoding_name)
return len(enc.encode(text))
except Exception:
return len(text.split()) * 1.3
def generate_lorem_prompt(target_tokens: int) -> str:
"""Generate a dummy prompt padded to exact token count."""
enc = tiktoken.get_encoding("cl100k_base")
prompt = LOREM_IPSUM
while len(enc.encode(prompt)) < target_tokens:
prompt += " " + LOREM_IPSUM
tokens = enc.encode(prompt)
exact_tokens = tokens[:target_tokens]
return enc.decode(exact_tokens)
def benchmark_single_request(client: openai.OpenAI, model_name: str, prompt: str, max_tokens: int = MAX_TOKENS) -> Dict[str, Any]:
"""Synchronous single request benchmark """
start_time = time.time()
first_token_time = None
full_response = ""
# Determine temperature and stop based on fixed mode logic
is_fixed_mode = max_tokens < 100
try:
stream = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
temperature=0.0 if is_fixed_mode else 0.7,
stream=True,
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
if first_token_time is None:
first_token_time = time.time()
full_response += chunk.choices[0].delta.content
prompt_tokens = count_tokens(prompt)
generated_tokens = count_tokens(full_response, "cl100k_base")
end_time = time.time()
total_time = end_time - start_time
ttft = (first_token_time - start_time) * 1000 if first_token_time else total_time * 1000
generation_time = end_time - first_token_time if first_token_time else 1e-6
tps = generated_tokens / generation_time if generated_tokens > 0 else 0
return {
"prompt_tokens": prompt_tokens,
"generated_tokens": generated_tokens,
"ttft_ms": ttft,
"total_time_s": total_time,
"tps": tps,
"status": "Success"
}
except Exception as e:
return {"error": str(e), "prompt_tokens": count_tokens(prompt), "status": "Failed"}We then define our benchmark modes:
class BenchmarkMode:
def __init__(self, name: str, runs: int, max_tokens: int):
self.name = name
self.runs = runs
self.max_tokens = max_tokens
def generate_prompts(self) -> List[Dict[str, Any]]:
raise NotImplementedError
class FixedMode(BenchmarkMode):
def generate_prompts(self) -> List[Dict[str, Any]]:
# Fixed Prompts: Same prompt length, same desired output length
FIXED_INPUT_TOKENS = 256
FIXED_OUTPUT_TOKENS = 128
prompt = generate_lorem_prompt(FIXED_INPUT_TOKENS)
prompt_list = []
for _ in range(self.runs * 3): # Run a fixed prompt 15 times for stability
prompt_list.append({
"prompt": prompt,
"max_tokens": FIXED_OUTPUT_TOKENS,
"mode": f"{self.name}_FixedInput_FixedOutput"
})
return prompt_listFinally, we define our main execution flow:
def run_standard_benchmark(client: openai.OpenAI, model_name: str, mode: BenchmarkMode) -> List[Dict[str, Any]]:
"""Runs synchronous single requests for Variable and Fixed modes (includes 1 warmup)."""
all_runs = mode.generate_prompts()
print(f"\n--- {mode.name} ({len(all_runs)} runs total) ---")
# 1. Warmup (Use the first prompt in the list)
try:
warmup_prompt_data = all_runs[0]
benchmark_single_request(client, model_name, warmup_prompt_data["prompt"], warmup_prompt_data["max_tokens"])
print("Warmup done.")
except Exception as e:
print(f"Warmup failed: {e}")
return []
# 2. Measurement Runs
results = []
for i, run_data in enumerate(all_runs):
print(f"Run {i+1}/{len(all_runs)}...", end="")
result = benchmark_single_request(client, model_name, run_data["prompt"], run_data["max_tokens"])
# Ensure all keys are present, even on failure
if result.get("status") == "Failed":
print(f" FAILED: {result.get('error', 'Unknown API Error')}")
# Pad the dictionary with default values for aggregation/logging
result["prompt_tokens"] = result.get("prompt_tokens", count_tokens(run_data["prompt"]))
result["generated_tokens"] = 0
result["ttft_ms"] = 0.0
result["total_time_s"] = 0.0
result["tps"] = 0.0
else:
print(f" TTFT: {result['ttft_ms']:.2f}ms, TPS: {result['tps']:.2f}")
# Use the guaranteed 'prompt_tokens' key
result["mode"] = run_data["mode"]
result["prompt_type"] = f"Input:{result['prompt_tokens']} Output:{run_data['max_tokens']}"
results.append(result)
return results
def generate_summary(mode_name: str, results: List[Dict[str, Any]]):
"""Generates and prints the summary statistics for a set of benchmark results."""
# 1. Filter out failed runs
success_results = [r for r in results if r.get("status") == "Success"]
if not success_results:
print(f"\nllama.cpp Summary ({mode_name}): All runs failed.")
return
# 2. Extract key metrics for calculation
ttft_values = [r["ttft_ms"] for r in success_results]
tps_values = [r["tps"] for r in success_results]
# Check if there are any prompt/max tokens to calculate the average for
avg_prompt_tokens = 0
avg_gen_tokens = 0
if success_results:
# Calculate the average prompt/max tokens used in successful runs
# Parse the 'prompt_type' string or ensuring the keys are in the dict
input_tokens = []
max_tokens = []
for r in success_results:
try:
# The 'prompt_type' is "Input:X Output:Y"
parts = r["prompt_type"].split()
input_tokens.append(int(parts[0].split(":")[1]))
max_tokens.append(int(parts[1].split(":")[1]))
except:
# Fallback if the format is wrong or keys are missing
pass
if input_tokens:
avg_prompt_tokens = int(np.mean(input_tokens))
if max_tokens:
# We use max_tokens as a proxy for 'Avg Gen Tokens' since the prompt asks for
# 'Avg Prompt/Gen Tokens' which typically means Avg_Input/Avg_Max_Output
avg_gen_tokens = int(np.mean(max_tokens))
# 3. Calculate Summary Statistics using NumPy
avg_ttft = np.mean(ttft_values)
std_ttft = np.std(ttft_values, ddof=1) # ddof=1 for sample standard deviation
p95_ttft = np.percentile(ttft_values, 95)
avg_tps = np.mean(tps_values)
std_tps = np.std(tps_values, ddof=1)
# 4. Print Summary
print(f"\nllama.cpp Summary ({mode_name}):")
print(f" Avg TTFT: {avg_ttft:.2f}ms (±{std_ttft:.2f}, p95: {p95_ttft:.2f}ms)")
print(f" Avg TPS: {avg_tps:.2f} (±{std_tps:.2f})")
print(f" Avg Prompt/Max Output Tokens: {avg_prompt_tokens}/{avg_gen_tokens}")
print("-" * 80)
def main():
name, config = list(SERVICES.items())[0]
client = openai.OpenAI(base_url=config["base_url"], api_key=config["api_key"])
all_final_results = []
# =========================================================================
# PHASE 1: Fixed Input/Output (The Hardware Isolation Test)
# Measures deterministic speed for clean comparison between flags/hardware.
# =========================================================================
fixed_mode = FixedMode("Fixed-Prompts", runs=5, max_tokens=128)
fixed_results = run_standard_benchmark(client, config["model"], fixed_mode)
all_final_results.extend(fixed_results)
generate_summary("Fixed-Prompts", fixed_results)
print("\n" + "="*80)
if __name__ == "__main__":
main()Results
Here is the result:
--- Fixed-Prompts (15 runs total) ---
Warmup done.
Run 1/15... TTFT: 239.21ms, TPS: 11.65
Run 2/15... TTFT: 229.69ms, TPS: 11.55
Run 3/15... TTFT: 234.86ms, TPS: 11.60
Run 4/15... TTFT: 220.71ms, TPS: 11.58
Run 5/15... TTFT: 209.01ms, TPS: 11.56
Run 6/15... TTFT: 214.51ms, TPS: 11.57
Run 7/15... TTFT: 236.20ms, TPS: 11.53
Run 8/15... TTFT: 245.57ms, TPS: 11.57
Run 9/15... TTFT: 217.84ms, TPS: 11.65
Run 10/15... TTFT: 217.73ms, TPS: 11.66
Run 11/15... TTFT: 225.34ms, TPS: 11.65
Run 12/15... TTFT: 223.79ms, TPS: 11.50
Run 11/15... TTFT: 225.34ms, TPS: 11.65
Run 12/15... TTFT: 223.79ms, TPS: 11.50
Run 12/15... TTFT: 223.79ms, TPS: 11.50
Run 13/15... TTFT: 208.61ms, TPS: 11.58
Run 14/15... TTFT: 219.19ms, TPS: 11.69
Run 15/15... TTFT: 200.86ms, TPS: 11.59
llama.cpp Summary (Fixed-Prompts):
Avg TTFT: 222.87ms (±12.47, p95: 241.12ms)
Avg TPS: 11.60 (±0.05)
Avg Prompt/Max Output Tokens: 256/128Varied prompt: real-world load
In this approach, we use a set of diverse, production-like prompts varying in length, complexity, and output length. The goal is to gain insights into the realistic latency and throughput of an LLM application under actual user load.
Because TTFT scales linearly with prompt length, a varied prompt set captures the real latency distribution (average, P50, P90, and P99) users will experience, informing the perceived quality of an application from end-users’ view.
It also reveals how the system handles a mixed workload and the actual concurrency limits before performance degrades unacceptably. These metrics provide clear estimates for total cost and required infrastructure for anticipated user traffic.
Code
We will reuse all the helper functions of fixed mode, and just add this class to create the varied prompts.
class VariableMode(BenchmarkMode):
def generate_prompts(self) -> List[Dict[str, Any]]:
# Variable Prompts: Short, Medium, Long (~20 to ~50 tokens)
PROMPTS = [
"Explain the importance of low-latency networking in cloud computing.",
"Explain the importance of low-latency networking in cloud computing in about 150 words.",
"Write a detailed essay summary on the importance of low-latency networking in cloud computing. Aim for 500 words.",
]
prompt_list = []
for prompt in PROMPTS:
for _ in range(self.runs):
prompt_list.append({
"prompt": prompt,
"max_tokens": self.max_tokens,
"mode": f"{self.name}_VariableInput_VariableOutput"
})
return prompt_listWe then add this to def_main to run the varied prompt benchmark.
# =========================================================================
# PHASE 2: Variable Input/Output (The User Experience Test)
# Measures realistic, non-deterministic performance across prompt complexity.
# =========================================================================
variable_mode = VariableMode("Variable-Prompts", runs=5, max_tokens=512)
variable_results = run_standard_benchmark(client, config["model"], variable_mode)
all_final_results.extend(variable_results)
generate_summary("Variable-Prompts", variable_results)Results
--- Variable-Prompts (15 runs total) ---
Warmup done.
Run 1/15... TTFT: 1562.86ms, TPS: 11.46
Run 2/15... TTFT: 1322.98ms, TPS: 11.49
Run 3/15... TTFT: 1307.48ms, TPS: 11.74
Run 4/15... TTFT: 205.17ms, TPS: 11.60
Run 5/15... TTFT: 1317.97ms, TPS: 11.66
Run 6/15... TTFT: 1689.91ms, TPS: 11.60
Run 7/15... TTFT: 1679.79ms, TPS: 11.82
Run 8/15... TTFT: 227.23ms, TPS: 11.67
Run 9/15... TTFT: 1710.92ms, TPS: 11.57
Run 10/15... TTFT: 887.09ms, TPS: 11.75
Run 11/15... TTFT: 1799.81ms, TPS: 11.57
Run 12/15... TTFT: 1805.09ms, TPS: 11.60
Run 13/15... TTFT: 1810.81ms, TPS: 11.42
Run 14/15... TTFT: 1799.12ms, TPS: 11.64
Run 15/15... TTFT: 1825.04ms, TPS: 11.79
llama.cpp Summary (Variable-Prompts):
Avg TTFT: 1396.75ms (±548.07, p95: 1815.08ms)
Avg TPS: 11.63 (±0.12)
Avg Prompt/Max Output Tokens: 18/512Conclusion
As expected, in Varied Prompt test, the average, standard deviation and p95 of TTFT varies greatly when compared with Fixed Prompt: 1396.75ms (±548.07, p95: 1815.08ms) vs 222.87ms (±12.47, p95: 241.12ms).
Average TPS is similar at 11.63 vs 11.60, but standard deviation is higher (±0.12) vs (±0.05).